‘Facebook friendship visualization by Paul Butler’
Often times geospatial data is best understood as exchanges or interactions. Peter Butler’s popular visualization of Facebook friendships provides a powerfully illustrates both the global popularity Facebook and some of the unique societal connections enabled by the social networking giant. Facebook in many ways has removed the need for geospatial proximity to maintain friendships.
Many forms of geospatial data are best expressed as interactions such as: financial transactions, SMS messages, shipping routes, etc. This tutorial will illustrate how to visualize geospatial networks at scale using R and the skopeABI package.
In this tutorial we will visualize 5007 flights available from FlowingData(Yau 2015) which we have made available within the skopeABI package as raw_flight_data. We will organize the tutorial using our standard four phase analytic pipeline doctrine:
For this tutorial we will use flight data available from FlowingData which we have made available within the skopeABI package as raw_flight_data.
airports <- read.csv("http://datasets.flowingdata.com/tuts/maparcs/airports.csv",
stringsAsFactors=FALSE,header=TRUE)
flights <- read.csv("http://datasets.flowingdata.com/tuts/maparcs/flights.csv",
stringsAsFactors=FALSE,header=TRUE, as.is=TRUE)The raw data consists of two data frames. airports consists of 7 attributes associated with 3379 airports within the united states:
airports=raw_flight_data$airports
str(airports)## 'data.frame': 3379 obs. of 7 variables:
## $ iata : chr "00M" "00R" "00V" "01G" ...
## $ airport: chr "Thigpen " "Livingston Municipal" "Meadow Lake" "Perry-Warsaw" ...
## $ city : chr "Bay Springs" "Livingston" "Colorado Springs" "Perry" ...
## $ state : chr "MS" "TX" "CO" "NY" ...
## $ country: chr "USA" "USA" "USA" "USA" ...
## $ lat : num 32 30.7 38.9 42.7 30.7 ...
## $ long : num -89.2 -95 -104.6 -78.1 -81.9 ...
The second item in the raw_flight_data list is a flights data frame consisting of 5007 flights expressed using 4 attributes:
flights=raw_flight_data$flights
str(flights)## 'data.frame': 5007 obs. of 4 variables:
## $ airline : chr "AA" "AA" "AA" "AA" ...
## $ airport1: chr "DFW" "MSP" "LGA" "TPA" ...
## $ airport2: chr "SJU" "DFW" "ORD" "JFK" ...
## $ cnt : int 120 326 860 56 44 550 496 200 214 50 ...
Data cleaning and transformation is often needed to make our data most useful. Although the data cleaning for this tutorial is already complete, we do need to make several transformations. We first need to generate a graph object using the igraph package. This will enable us to conduct network analysis of our flight data. We then have to convert the flights edgelist to contain the corresponding lat/lon for each airport expressed in a given edge. Finally we need to generate arc data in long format for visualization.
Several flights associated with airports in East Asia cross the anti-meridian causing problems with out map boundaries.
## airport lat long
## 2797 Prachinburi 14.078333 101.3783
## 2798 Babelthoup/Koror 7.367222 134.5442
## 3004 Tinian International Airport 14.996111 145.6214
## 3359 Yap International 9.516700 138.1000
## country
## 2797 Thailand
## 2798 Palau
## 3004 N Mariana Islands
## 3359 Federated States of Micronesia
For the purpose of this tutorial we will simply remove these flights; however, we will address geospatial networks that cross the anti-meridian in a subsequent tutorial.
# remove airports that are not inside the United States
airports=airports[airports$country == 'USA',]
str(airports)## 'data.frame': 3375 obs. of 7 variables:
## $ iata : chr "00M" "00R" "00V" "01G" ...
## $ airport: chr "Thigpen " "Livingston Municipal" "Meadow Lake" "Perry-Warsaw" ...
## $ city : chr "Bay Springs" "Livingston" "Colorado Springs" "Perry" ...
## $ state : chr "MS" "TX" "CO" "NY" ...
## $ country: chr "USA" "USA" "USA" "USA" ...
## $ lat : num 32 30.7 38.9 42.7 30.7 ...
## $ long : num -89.2 -95 -104.6 -78.1 -81.9 ...
# remove flights associated with non-US airports
flights=flights[flights$airport1 %in% airports$iata & flights$airport2 %in% airports$iata,]
str(flights)## 'data.frame': 5005 obs. of 4 variables:
## $ airline : chr "AA" "AA" "AA" "AA" ...
## $ airport1: chr "DFW" "MSP" "LGA" "TPA" ...
## $ airport2: chr "SJU" "DFW" "ORD" "JFK" ...
## $ cnt : int 120 326 860 56 44 550 496 200 214 50 ...
Our first step is to generate an empty graph and then assign our 7 attributes to the vertices of the graph:
G=graph.empty(n=length(airports$iata),directed=TRUE)
V(G)$name=airports$iata
V(G)$airport=airports$airport
V(G)$lat=airports$lat
V(G)$lon=airports$lon
V(G)$city=airports$city
V(G)$state=airports$state
V(G)$country=airports$country
vcount(G)## [1] 3375
We then add flights as edges along with 2 additional attributes cnt and airline. We label cnt as weight to facilitate weighted graph functionality within the igraph package.
G=add_edges(G,edges=rbind(flights$airport1,flights$airport2))
E(G)$weight=flights$cnt
E(G)$airline=flights$airline
# delete vertices with no flight data
G=delete_vertices(G,v=degree(G,mode='total')==0)
vcount(G)## [1] 282
ecount(G)## [1] 5005
Similar to the raw_flight_data object. We make the graph above available within skopeABI as flight_graph
Currently our graph, G, has its edges expressed by iata.
head(get.edgelist(G,names=TRUE))## [,1] [,2]
## [1,] "DFW" "SJU"
## [2,] "MSP" "DFW"
## [3,] "LGA" "ORD"
## [4,] "TPA" "JFK"
## [5,] "STT" "BOS"
## [6,] "PHX" "DFW"
However, to generate arcs for our visualization, we will need the lat/lon for the origin and destination airports for each flight. For this we provide the buildEdgeCoordDF function:
arc_point_df=buildEdgeCoordDF(G,attributes=list(airline=E(G)$airline,weight=E(G)$weight))
head(arc_point_df)## source target airline weight sourceLat sourceLon targetLat targetLon
## 1 DFW SJU AA 120 32.89595 -97.03720 18.43942 -66.00183
## 2 MSP DFW AA 326 44.88055 -93.21692 32.89595 -97.03720
## 3 LGA ORD AA 860 40.77724 -73.87261 41.97960 -87.90446
## 4 TPA JFK AA 56 27.97547 -82.53325 40.63975 -73.77893
## 5 STT BOS AA 44 18.33731 -64.97336 42.36435 -71.00518
## 6 PHX DFW AA 550 33.43417 -112.00806 32.89595 -97.03720
To express these flights as arcs on a map requires us to enrich and transform the endpoint depicted in our flight_data data frame. Although several tutorials provide examples of plotting relatively small transactional data sets in R, they often use loops to generate each are as an individual layer which will cause problems with batch limits at scale (for example see this RBloggers Tutorial (“Create Air Travel Route Maps in Ggplot: A Visual Travel Diary. R-Bloggers” 2017)). This limitation is easily overcome by converting our data structure from wide to long format (see the RCookbook (“Converting Data Between Wide and Long Format” 2019)). We provide the arcMelt function to handle arc generation and visualization at scale.
# Build melted arc data
arc_data=arcMelt(arc_point_df,n=5,cores=1,weighted=TRUE)
head(arc_data)## lon lat index weight
## 1: -97.03720 32.89595 1 120
## 2: -91.30580 30.99844 1 120
## 3: -85.81177 28.85820 1 120
## 4: -80.54996 26.50462 1 120
## 5: -75.50737 23.96607 1 120
## 6: -70.66540 21.26935 1 120
In addition to the arc_data data.frame, we need to load the necessary shape files to construct our map layer. We provide wrapper around the package rnaturalearth which provides a map of countries of the entire world. Use ne_countries to pull country data and choose the scale (rnaturalearthhires is necessary for scale = “large”). The function can return sp classes (default) or directly sf classes, as defined in the argument returnclass. The code below provides a based worldmap.
#construct map w/rnaturalearth and sf
world <- ne_countries(scale = "medium", returnclass = "sf")
m=ggplot(data = world) +
geom_sf(colour=gray(.5),size=.05,fill=gray(.2))
mTRUE
Although we now have basic map data, we must adjust several parameters to develop meaningful visualizations of graphs with large numbers of edges.
The concept of centrality within networks attempts to identify the most important vertices or nodes within a network. By modeling our flight data as a graph, this enables us to identify the most important airports within the US based on graph metrics that account for more than simple counts.
We develop three additional features/columns in our airports data frame base on graph centrality measures and construct an interactive plot below.
Hover over each data point to identify the corresponding airport.
# remove airports that do not have flight data
airports=airports[ airports$iata %in% V(G)$name, ]
# add node metrics to airports data.frame
airports$ev_centrality=eigen_centrality(G,directed=TRUE,scale=TRUE)$vector
airports$degree=degree(G,v=V(G),mode='total')
airports$betweenness=betweenness(G,v=V(G),directed=TRUE,weights=E(G)$weight,normalized=TRUE)
# construct interactive plot
f <- list(
family = "Courier New, monospace",
size = 18,
color = "#7f7f7f")
x <- list(
title = "eigenvector centrality",
titlefont = f)
y <- list(
title = "betweenness",
titlefont = f)
p <- plot_ly(type = 'scatter', mode = 'markers') %>%
add_trace(
x = airports$ev_centrality,
y = airports$betweenness,
size=airports$degree/max(airports$degree)+.25,
text = airports$airport,
hoverinfo = 'text',
marker = list(color='#a8262b',
line = list(
color = 'black',
width = 1)),
showlegend = F) %>%
layout(xaxis = x, yaxis = y)
pTRUE
It is also interesting to look at nodes with anomalous characteristics. In the visualization below we use mahanolobis distance based on node betweenness and eigenvector centrality to idenfity nodes that have unusual characteristics. The y-axis in the graphic below corresponds to mahanolobis distance of our airports eigenvector and betweenness centrality measures. If we think of our airport’s features as a point p in 2D space where x corresponds to eigenvector centrality, and y corresponds to betweenness centrality. Mahalanobis distance measures the distance between an airport’s point p and the center of our entire data set’s distribution. Introduced by P. C. Mahalanobis in 1936, it is a multi-dimensional generalization of the idea of measuring how many standard deviations away P is from the mean of D. You can see a clear parabolic trend in the visualization. Points above this trendline are worthy of manual inspection.
TRUE
Now we look to visualize our network geospatially. To do so we must adjust the appearance of our map layer to enable effective visualization and ensure our arcs are scaled appropriately for effective visualization.
Visualizing geospatial networks at scale requires strong contrast between the map layer and the network layer of the visualization. For this purpose we provide a setGeoNetworkTheme function that provides a dark map with subtle boundaries.
Large scale in terms of the number of arcs to be plotted requires auto bounding of the map layer. For this purpose we provide a getMapBoundaries function to easily enable the analyst to set a buffer margin and use the data to define the plot region.
#construct map w/rnaturalearth and sf
setGeoNetworkTheme()
map_boundaries=getMapBoundaries(arc_data,buffer=0.05)
m=ggplot(data = world) +
# define map boundaries
xlim(map_boundaries$lon[1],map_boundaries$lon[2])+
ylim(map_boundaries$lat[1],map_boundaries$lat[2])+
# plot map layer
geom_sf(colour=gray(.5),size=.05,fill=gray(.2))+
# add arcs
geom_line(aes(x=lon,y=lat,group=index),
size=1,
color='green1',
data=arc_data)+
theme(legend.position = 'none')TRUE
Finally we must adjust the width and transparancy of each arc to highlight structure within the plot. To do so we adjust parameters within the geom_line portion below by adding the ‘alpha’ parameter for transparency as a function of the count per arc (i.e. passenger volume). We also adjust the size parameter of the geom_line function. Finally, we use the geom_point function to plot points for all of the airports within our data and adjust for scale based on Mahalanobis Distance.
m=ggplot(data = world) +
xlim(map_boundaries$lon[1],map_boundaries$lon[2])+
ylim(map_boundaries$lat[1],map_boundaries$lat[2])+
geom_sf(colour=gray(.5),size=.05,fill=gray(.2))+
geom_line(aes(x=lon,y=lat,group=index,alpha=weight),
size=.05,
color='green1',
data=arc_data)+
geom_point(aes(x=long,y=lat,size=mahalanobis),color='cyan',
data=airports)+
scale_size(range=c(.005,1))+
theme(legend.position = 'none')TRUE
In this tutorial we have walked through how to generate geospatial network visualizations in R using a few simpe wrappers and data provided in the skopeABI package. We hope you find this tutorial useful, and welcome feedback and contributions to our open development project.
“Converting Data Between Wide and Long Format.” 2019. Accessed December 31. http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/.
“Create Air Travel Route Maps in Ggplot: A Visual Travel Diary. R-Bloggers.” 2017. March 6. https://www.r-bloggers.com/create-air-travel-route-maps-in-ggplot-a-visual-travel-diary/.
Yau, Nathan. 2015. “Thanksgiving Flight Patterns. FlowingData.” November 25. https://flowingdata.com/2015/11/25/thanksgiving-flight-patterns/.